Automatic human utility evaluation of ASR systems: does WER really predict performance?
نویسندگان
چکیده
We propose an alternative evaluation metric to Word Error Rate (WER) for the decision audit task of meeting recordings, which exemplifies how to evaluate speech recognition within a legitimate application context. Using machine learning on an initial seed of human-subject experimental data, our alternative metric handily outperforms WER, which correlates very poorly with human subjects’ success in finding decisions given ASR transcripts with a range of WERs.
منابع مشابه
End-to-End Evaluation of a Speech-to-Speech Translation System in TC-STAR
The paper describes an evaluation methodology to evaluate speech-to-speech translation systems and their results. The evaluation scheme uses questionnaires filled in by human judges for addressing the adequacy and fluency of audio translation outputs and was applied in the second TC-STAR evaluation campaign. The same evaluation methodology is carried out both on the outputs of an automatic syst...
متن کاملUsing Audio Quality to Predict Word Error Rate in an Automatic Speech Recognition System
Faced with a backlog of audio recordings, users of automatic speech recognition (ASR) systems would benefit from the ability to predict which files would result in useful output transcripts in order to prioritize processing resources. ASR systems used in non-research environments typically run in “real time”. In other words, one hour of speech requires one hour of processing. These systems prod...
متن کاملHow to evaluate ASR output for named entity recognition?
The standard metric to evaluate automatic speech recognition (ASR) systems is the word error rate (WER). WER has proven very useful in stand-alone ASR systems. Nowadays, these systems are often embedded in complex natural language processing systems to perform tasks like speech translation, manmachine dialogue, or information retrieval from speech. This exacerbates the need for the speech proce...
متن کاملTowards spoken clinical-question answering: evaluating and adapting automatic speech-recognition systems for spoken clinical questions
OBJECTIVE To evaluate existing automatic speech-recognition (ASR) systems to measure their performance in interpreting spoken clinical questions and to adapt one ASR system to improve its performance on this task. DESIGN AND MEASUREMENTS The authors evaluated two well-known ASR systems on spoken clinical questions: Nuance Dragon (both generic and medical versions: Nuance Gen and Nuance Med) a...
متن کاملOn the Use of Information Retrieval Measures for Speech Recognition Evaluation
This paper discusses the evaluation of automatic speech recognition (ASR) systems developed for practical applications, suggesting a set of criteria for application-oriented performance measures. The commonly used word error rate (WER), which poses ASR evaluation as a string editing process, is shown to have a number of limitations with respect to these criteria, motivating alternative or addit...
متن کامل